Take-home_Ex03

Author

NeoYX

Published

May 28, 2023

Modified

June 12, 2023

Vast Challenge 2023 Mini Challenge 3 (Subtask: 3)

Tasks and Questions:

Use visual analytics to understand patterns of groups in the knowledge graph and highlight anomalous groups.

  1. Use visual analytics to identify anomalies in the business groups present in the knowledge graph. Limit your response to 400 words and 5 images.

  2. Develop a visual analytics process to find similar businesses and group them. This analysis should focus on a business’s most important features and present those features clearly to the user. Limit your response to 400 words and 5 images.

  3. Measure similarity of businesses that you group in the previous question. Express confidence in your groupings visually. Limit your response to 400 words and 4 images.

  4. Based on your visualizations, provide evidence for or against the case that anomalous companies are involved in illegal fishing. Which business groups should FishEye investigate further? Limit your response to 600 words and 6 images.

  5. Reflection: What was the most difficult aspect of working with this knowledge graph? Did you have the tools and resources you needed to complete the challenge? What additional resources would have helped you? Limit your response to 300 words

1 About the dataset

1.1 Data dictionary

Node Attributes:

• type – Type of node as defined above.

• country – Country associated with the entity. This can be a full country or a two-letter country code.

• product_services – Description of product services that the “id” node does.

• revenue_omu – Operating revenue of the “id” node in Oceanus Monetary Units.

• id – Identifier of the node is also the name of the entry.

• role – The subset of the “type” node, not in every node attribute.

• dataset – Always “MC3”.

Edge Attributes:

• type – Type of the edge as defined above.

• source – ID of the source node.

• target – ID of the target node.

• dataset – Always “MC3”.

• role - The subset of the “type” node, not in every edge attribute.

1.2 Importing the datasets

Import libraries

The new libraries used today are :

  • jsonlite to import json file
  • tidytext is to do basic text mining in R
Show the code
pacman::p_load(jsonlite, igraph, tidygraph, ggraph, DT, visNetwork, lubridate, clock, tidyverse, graphlayouts,knitr,plotly, ggiraph, ggstatsplot,
ggHoriPlot, ggthemes,hrbrthemes,treemap,patchwork, ggiraph,proxy, tidytext, skimr,
GGally, parallelPlot, tidyverse)

Load the MC3 dataset

Show the code
mc3_data <- fromJSON("C:/yixin-neo/ISSS608-VAA/Project/data/mc3.json")

Extracting edges

The code chunk below will be used to extract the links data.frame of mc3_data and save it as a tibble data.frame called mc3_edges.

Show the code
mc3_edges <- as_tibble(mc3_data$links) %>% 
  distinct() %>%
  mutate(source = as.character(source),
         target = as.character(target),
         type = as.character(type)) %>%
  group_by(source, target, type) %>%
    summarise(weights = n()) %>%
  filter(source!=target) %>%
  ungroup()
Note
  • distinct() is used to ensure that there will be no duplicated records.

  • mutate() and as.character() are used to convert the field data type from list to character.

  • group_by() and summarise() are used to count the number of unique links.

  • the filter(source!=target) is to ensure that no record with similar source and target.

There are no duplicates in the mc3_edges dataframe.

Show the code
sum(duplicated(mc3_edges))
[1] 0

Extracting nodes

The code chunk below will be used to extract the nodes data.frame of mc3_data and save it as a tibble data.frame called mc3_nodes.

Show the code
mc3_nodes <- as_tibble(mc3_data$nodes) %>%
  mutate(country = as.character(country),
         id = as.character(id),
         product_services = as.character(product_services),
         revenue_omu = as.numeric(as.character(revenue_omu)),
         type = as.character(type)) %>%
  select(id, country, type, revenue_omu, product_services)
Note
  • mutate() and as.character() are used to convert the field data type from list to character.

  • To convert revenue_omu from list data type to numeric data type, we need to convert the values into character first by using as.character(). Then, as.numeric() will be used to convert them into numeric data type.

  • select() is used to re-organise the order of the fields.

Check for duplicates in mc3_nodes dataframe across all columns and remove them.

Show the code
#sum(duplicated(mc3_nodes))
#mc3_nodes[duplicated(mc3_nodes), ]

# Remove duplicates based on all columns
mc3_nodes <- mc3_nodes[!duplicated(mc3_nodes), ]

2 Initial Data Exploration

2.1 Exploring the edges data frame

In the code chunk below, skim() of skimr package is used to display the summary statistics of mc3_edges tibble data frame.

Show the code
skim(mc3_edges)
Data summary
Name mc3_edges
Number of rows 24036
Number of columns 4
_______________________
Column type frequency:
character 3
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1 6 700 0 12856 0
target 0 1 6 28 0 21265 0
type 0 1 16 16 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
weights 0 1 1 0 1 1 1 1 1 ▁▁▇▁▁

The report above reveals that there are no missing values in all fields.

Why are there source columns with maximum character length of 700?

Checking for the longest length value in source column of edge file.

Show the code
# Find the index of the value with the maximum length
max_index <- which.max(nchar(mc3_edges$source))

# Extract the value with the maximum length in `source` column
max_value <- mc3_edges$source[max_index]
max_value
[1] "c(\"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \n\"Baltic Sprat Incorporated Investment\", \"Baltic Sprat Incorporated Investment\", \"Water World Limited Liability Company Freight\", \"Water World Limited Liability Company Freight\")"

It seems like we found more issue with the mc3_edge dataframe. Some of the source column values are still in a list format. We need to unlist the source companies that are hidden.

2.1.1 Cleaning the edges data frame

Now, I would like to split into mc3_edges dataframe into two dataframes, where

df1 : containing rows where the source does NOT contain “c(” <- actual source value, no further processing needed

df2 : containing rows where the source contains values starting with “c(” . As the source column still contains inner list of source entities , it still needs to be unlisted further.

Show the code
# Creating an empty data frame for the two subsets
df1 <- data.frame()
df2 <- data.frame()

# Looping through each row of the original data frame
for (i in 1:nrow(mc3_edges)) {
  if (!grepl("c\\(", mc3_edges$source[i])) {
    # Append the row to df1 if source does not contain "c("
    df1 <- rbind(df1, mc3_edges[i, ])
  } else {
    # Append the row to df2 if source contains "c("
    df2 <- rbind(df2, mc3_edges[i, ])
  }
}

df1 dataframe

source target type weights
1 AS Marine sanctuary Christina Taylor Company Contacts 1
1 AS Marine sanctuary Debbie Sanders Beneficial Owner 1
1 Ltd. Liability Co Cargo Angela Smith Beneficial Owner 1
1 S.A. de C.V. Catherine Cox Company Contacts 1
1 and Sagl Forwading Angela Mendoza Company Contacts 1

df2 dataframe

source target type weights
c(“1 Ltd. Liability Co”, “1 Ltd. Liability Co”) Yesenia Oliver Company Contacts 1
c(“1 Swordfish Ltd Solutions”, “1 Swordfish Ltd Solutions”, “Saharan Coast BV Marine”, “Olas del Sur Estuary”) Daniel Reese Company Contacts 1
c(“5 Limited Liability Company”, “Bahía de Coral Kga”) Brittany Jones Beneficial Owner 1
c(“5 Limited Liability Company”, “Bahía de Coral Kga”) Elizabeth Torres Beneficial Owner 1
c(“5 Limited Liability Company”, “Bahía de Coral Kga”) Sandra Roberts Company Contacts 1

We would need to clean up the df2 dataframe’s source column.

The code chunk below uses the

  • gsub function to remove the unwanted characters from the source column. It replaces any occurrence of " or c( or ) with an empty string, ’’

  • trimws function to remove any leading or trailing whitespace from the cleaned source column,

  • \\: The backslash is an escape character in regular expressions. In this case, it is used to escape the closing parenthesis character, so it is treated as a literal character in the pattern.

Show the code
df2$source <- gsub('["c(\\)]', '', df2$source)
df2$source <- trimws(df2$source)
kable(head(df2,5))
source target type weights
1 Ltd. Liability Co, 1 Ltd. Liability Co Yesenia Oliver Company Contacts 1
1 Swordfish Ltd Solutions, 1 Swordfish Ltd Solutions, Saharan Coast BV Marine, Olas del Sur Estuary Daniel Reese Company Contacts 1
5 Limited Liability Company, Bahía de Coral Kga Brittany Jones Beneficial Owner 1
5 Limited Liability Company, Bahía de Coral Kga Elizabeth Torres Beneficial Owner 1
5 Limited Liability Company, Bahía de Coral Kga Sandra Roberts Company Contacts 1

For each row, extract each entity in Source column and insert this entity as a new row.

Show the code
# Create a new data frame for the modified rows
df2_modified <- data.frame()

# Loop through each row of df2
for (i in 1:nrow(df2)) {
  # Split the source value by comma
  source_values <- unlist(strsplit(df2$source[i], ", "))
  
  # Create a new row for each source value
  for (value in source_values) {
    # Create a new row with the same "target", "type", and "weights" values
    new_row <- data.frame(
      source = value,
      target = df2$target[i],
      type = df2$type[i],
      weights = df2$weights[i]
    )
    
    # Append the new row to df2_modified
    df2_modified <- rbind(df2_modified, new_row)
  }
}

# Print the modified data frame
#cat("df2_modified:\n")
kable(head(df2_modified,10))
source target type weights
1 Ltd. Liability Co Yesenia Oliver Company Contacts 1
1 Ltd. Liability Co Yesenia Oliver Company Contacts 1
1 Swordfish Ltd Solutions Daniel Reese Company Contacts 1
1 Swordfish Ltd Solutions Daniel Reese Company Contacts 1
Saharan Coast BV Marine Daniel Reese Company Contacts 1
Olas del Sur Estuary Daniel Reese Company Contacts 1
5 Limited Liability Company Brittany Jones Beneficial Owner 1
Bahía de Coral Kga Brittany Jones Beneficial Owner 1
5 Limited Liability Company Elizabeth Torres Beneficial Owner 1
Bahía de Coral Kga Elizabeth Torres Beneficial Owner 1

Question to think about: ‘Do we aggregate the weights for each ’source’ -‘target’ pair or treat them as duplicates?

For now, let us aggregate and store the number of occurrence in ‘weights’ column. Now we have df2_modified dataframe where each ‘source-target’ pair is unique and have no duplicates.

Show the code
df2_modified <- df2_modified %>% 
  group_by(source, target, type) %>% 
  summarise(weights = sum(weights)) %>% 
  ungroup() %>% 
  arrange(desc(weights))

kable(head(df2_modified))
source target type weights
Niger River Delta S.p.A. Cole Allen Company Contacts 20
Niger River Delta S.p.A. Shawn Myers Company Contacts 20
Balti Sprat Inorporated Investment Jose Ramirez Company Contacts 14
Cape Verde Islands Pl Otter Duane Edwards Company Contacts 14
Cape Verde Islands Pl Otter Jill Newman Beneficial Owner 14
Greek Makerel Ltd. Corporation Express Megan Wyatt Company Contacts 13

Finally, we will appended df1 below df2_modified to get our edges_combined data frame, now with 24,935 rows after unlisting all source entities.

In the code chunk below, datatable() of DT package is used to display mc3_edges tibble data frame as an interactive table on the html document.

Show the code
# Append df1 to df2_modified
edges_combined <- rbind(df2_modified, df1)

# Print the combined data frame
DT::datatable(edges_combined)

Let us use bar chart to visualise the type column. (Improve on this bar chart later)

Show the code
ggplot(data = edges_combined,
       aes(x = type)) +
  geom_bar()

The edges_combined df contains relationship between individuals and companies, and describe whether an individual is a Beneficial Owner or a Company Contact of a company.

2.1.2 Build a node dataframe using new edge dataframe

Rebuild a brand new mc3_nodes1 dataframe (enhanced version of mc3_nodes dataframe), by appending unique target and source entities from the edges_combined dataframe to get a column id.

This newly created dataframe only has 1 column id. We can bring in nodes attributes like country, type, revenue_omu, product_services from the original mc3_nodes dataframe into the mc3_nodes1 dataframe.

Show the code
id1 <- edges_combined %>%
  select(source) %>%
  rename(id = source)
id2 <- edges_combined %>%
  select(target) %>%
  rename(id = target)
mc3_nodes1 <- rbind(id1, id2) %>%
  distinct() %>%                  #<< ensures no duplicated in `id` column
  left_join(mc3_nodes, by='id',
            unmatched = "drop")  #<<< bring in 4 node attributes from `mc3_nodes` df

After the left join, mc3_nodes1 df increased in number of rows from 34,443 to 35,767 rows because entities like Adams LLC has both company contacts and beneficial owner types in mc3_nodes df.

Let us take a sneak peak at the mc3_nodes1 df now. We notice the following:

  • ‘Andrew Yu’ does not have value in the type column. However, its type value can be retrieved from the edges_combined df.

  • ‘Andrews PLC’ as a single entity has two types, namely ‘Company Contacts’ and ‘Beneficial Owners’.

Show the code
mc3_nodes1 %>% 
  arrange(id) %>% 
  slice(1563:1570) %>%
  kable()
id country type revenue_omu product_services
Andrew Woodward NA NA NA NA
Andrew Yu NA NA NA NA
Andrews LLC ZH Company 72609.01 Food preparations and kindred products
Andrews Ltd ZH Company Contacts NA character(0)
Andrews PLC ZH Beneficial Owner NA character(0)
Andrews PLC ZH Company Contacts NA character(0)
Andrews and Sons ZH Company NA Unknown
Andrews and Sons ZH Company Contacts NA character(0)

Looking into the edges_combined df where the type attribute of ‘Andrew Yu’ is stored.

He is a beneficial owner of ‘Mar del Paraíso GmbH’.

Show the code
edges_combined %>% 
  filter(grepl('Andrew Yu', target)) %>% 
  kable()
source target type weights
Mar del Paraíso GmbH Andrew Yu Beneficial Owner 1

We would like to ingest his relationship type with ‘Mar del Paraíso GmbH’ into the mc3_nodes1 dataframe.

We can further enrich each id’s type attribute of the mc3_nodes1 dataframe by extracting the type values of the target entity from the edges_combined dataframe. The number of records further increased from 35,767 to 39,437 because each target entity in edges_combined df can have more than 1 type. An example is ‘Aaron Garcia’ - who is both a company contacts and a beneficial owner type to a company.

Show the code
# Perform a left join to bring in the "type" values from 'edges_combined` df to 'mc3_nodes1` df

merged_df <- left_join(mc3_nodes1, edges_combined, by = c("id" = "target"))

# Replace NA values in the "type.x" column with non-NA values from the "type.y" column
merged_df$type.x[is.na(merged_df$type.x)] <- merged_df$type.y[is.na(merged_df$type.x)]

# Select the relevant columns and rename "type.x" to "type"
merged_df <- merged_df %>%
  select(id, country, type = type.x, revenue_omu, product_services)
Note

The code chunk above is used to replace the NA values in the “type.x” column of the merged_df dataframe with non-NA values from the “type.y” column.

  • merged_df$type.x refers to the “type.x” column of the merged_df dataframe. This is the column where we want to replace the NA values.

  • is.na(merged_df$type.x) creates a logical vector with TRUE for NA values and FALSE for non-NA values in the “type.x” column.

  • merged_df$type.y[is.na(merged_df$type.x)] uses the logical vector as an index to select only the non-NA values from the “type.y” column corresponding to the NA values in the “type.x” column.

Note that the type attribute of ‘Andrew Yu’ and similar others have been added to this df.

id country type revenue_omu product_services
Andrew Woodward NA Beneficial Owner NA NA
Andrew Yu NA Beneficial Owner NA NA
Andrews LLC ZH Company 72609.01 Food preparations and kindred products
Andrews Ltd ZH Company Contacts NA character(0)
Andrews PLC ZH Beneficial Owner NA character(0)
Andrews PLC ZH Company Contacts NA character(0)
Andrews and Sons ZH Company NA Unknown
Andrews and Sons ZH Company Contacts NA character(0)

Check for duplicates. There are once again duplicates after mc3_nodes1 left outer join with edges_combined df because in edges_combined df, there are many relationship types for each individual and its associated company. (E.g check Cole Allen).

We will drop the duplicates in merged_df df, leaving us with 36,731 rows. merged_df is now an enhanced version of my main nodes file. It contains nodes (seafood and non-seafood related) that are present in the edges_combined df. In addition, node attribute type column was also enhanced into this df from edges_combined df. Using datatable() of DT pacakge, let us take a look at merged_df as an interactive table on the html document using.

Show the code
merged_df <- merged_df[!duplicated(merged_df), ]  # 36,731 rows
DT::datatable(merged_df)

Finally, the nodes and edges dataframes are ready! We will use the merged_df as our main nodes file and edges_combined as main edges file from now. Before further cleaning, with what we have now, lets build an initial visualisation of the network.

2.1.3 Building network model with tidygraph

First, create a graph object using tbl_graph() function. Then calculate betweenness and closeness centrality scores.

Show the code
mc3_graph <- tbl_graph(nodes = merged_df,
                       edges = edges_combined,
                       directed = FALSE) %>%
  mutate(betweenness_centrality = centrality_betweenness(),
         closeness_centrality = centrality_closeness())

mc3_graph <- mc3_graph %>% 
  mutate(membership = components(mc3_graph)$membership)

2.1.4 Visualising network graph

We will filter nodes with high betweenness centrality scores (>2,000,000) and visualise them to see the relationships that they have.

Show the code
set.seed(1234)
mc3_graph %>%
  filter(betweenness_centrality >= 2000000) %>%
ggraph(layout = "fr") +
  geom_edge_link(aes(#width= weights,
                     alpha=0.5)) +
  geom_node_point(aes(
    size = betweenness_centrality,
    #color = type,
    alpha = 0.3)) +
  geom_node_label(aes(label = id),repel=TRUE, size=2.5, alpha = 0.8) +
  scale_size_continuous(range=c(1,10)) +
  theme_graph() +
  labs(title = 'Initial network visualisation \n(Seafood and non-seafood)',
       subtitle = 'Entities with betweenness scores > 2,000,000')

Below is a dataframe showing us the top 10 entities with the highest betweenness scores.

Show the code
mc3_graph %>%  activate(nodes) %>%  as_tibble() %>% arrange(desc(betweenness_centrality)) %>% slice(1:10) %>% kable() 
id country type revenue_omu product_services betweenness_centrality closeness_centrality membership
Wave Warriors S.A. de C.V. Express Lumindoria Company 1761580.75 Beer, ale, and malt liquors, as well as nonalcoholic beer and other related products 3981896 1.31e-05 6
Dutch Oyster Sagl Cruise ship Marebak Company 62913.42 Import and export of textiles, knitwear and raw materials 3878339 1.34e-05 6
Senegal Coast Ltd. Liability Co Oceanus Company NA Unknown 3725057 1.34e-05 6
Limpopo River Ltd. Liability Co Marebak Company 12065.26 Meat and processed meat products 3632403 1.30e-05 6
Ocean Observers Marine mist Puerto Sol Company 39678.54 Transportation and other related services 3559968 1.24e-05 6
Matthew Reynolds NA Company Contacts NA NA 3484895 1.31e-05 6
Niger Bend AS Express Puerto Sol Company 613590.73 Cruise ship holidays 3479011 1.32e-05 6
Luangwa River Limited Liability Company Holdings Sol y Oceana Company NA Chemicals and allied products, such as acids, industrial and heavy chemicals, dyestuffs, industrial salts, rosin, and turpentine 3305332 1.23e-05 6
Jennifer Smith NA Beneficial Owner NA NA 3142993 1.25e-05 6
Coral del Mar SE United Oceanus Company NA Unknown 3133167 1.34e-05 6

The top 10 betweenness entities above are not dealing with seafood related industries directly. In the next section, we will filter entities from the merged_df dataframe for only seafood related entities. We may revisit the non-seafood entities later when we have specific targets/companies to investigate.

HIVE plot (Later )

3 Exploring the merged_df nodes data frame

In the code chunk below, skim() of skimr package is used to display the summary statistics of merged_df tibble data frame. We will not be exploring mc3_nodes dataframe as merged_df dataframe is an enhanced version of it.

Show the code
skim(merged_df)
Data summary
Name merged_df
Number of rows 36731
Number of columns 5
_______________________
Column type frequency:
character 4
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
id 0 1.00 6 64 0 34442 0
country 30045 0.18 2 14 0 85 0
type 7816 0.79 7 16 0 3 0
product_services 30045 0.18 4 1737 0 2052 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
revenue_omu 33158 0.1 877262.3 11772097 3652.23 8375.22 17466.67 50229.65 310612303 ▇▁▁▁▁

The report above reveals that there are many missing values in country, type, product_services columns. In addition, that there are several ‘Unknown’ values in the product_services column, which is a good as missing. This section focuses on cleaning the product_services columns by first performing text sensing and adding a label to each company as either ‘seafood’, ‘others’ or ‘unknown’.

The report above also reveal that there are 91% Na values in the revenue_omu column.

Check the distribution of the type column in merged_df dataframe.

Show the code
merged_df$type1 <- ifelse(is.na(merged_df$type), "Missing", merged_df$type)
merged_df$type1 <- as.factor(merged_df$type1)


ggplot(data = merged_df,
       aes(x = reorder(type1,type1,
                     function(x)+length(x)))) +
  geom_bar(fill='lightblue') +
  ylim(0,20000) +
  geom_text(stat="count", 
            aes(label=paste0(..count.., ", ", 
                             round(..count../sum(..count..)*100, 1), "%")),
            hjust=-0.1,
            size = 5) +
  coord_flip() +
  theme_minimal() +
  labs(x = 'Relationship type between entities',
       title = 'Distribution of business relationships') +
  theme(plot.title = element_text(size = 22,
                                  face='bold', 
                                  hjust = 0.5),
        axis.title.x=element_text(size= 20,
                                  hjust = 0.5),
        axis.title.y=element_text(size= 20),
        axis.text.x = element_text(size = 15, 
                                  color = "black", 
                                  angle = 0, 
                                  hjust = 0.5, 
                                  vjust = 0.5),
        axis.text.y = element_text(size = 15, 
                                  color = "black", 
                                  angle = 0, 
                                  hjust = 0.5, 
                                  vjust = 0.5),
        panel.grid.major.y = element_blank() )

3.1 Text Sensing with tidytext

In section 2.1.2, we saw in merged_df dataframe that the product_services column contains raw text data on products that each entity provides. We would like to give each company a meaningful label based on its product_services.

Hence in this section, we will perform basic text sensing using appropriate functions of tidytext package.

3.1.1 Simple word count

The code chunk below calculates number of times the word fish appeared in the field product_services.

Show the code
merged_df %>% 
    mutate(n_fish = str_count(product_services, "fish")) %>% arrange(desc(n_fish)) %>% head()
# A tibble: 6 × 7
  id                     country type  revenue_omu product_services type1 n_fish
  <chr>                  <chr>   <chr>       <dbl> <chr>            <fct>  <int>
1 Gvardeysk Sextant ОАО… Uziland Comp…      73027. Fish salads (It… Comp…     11
2 Taylor LLC             ZH      Comp…     138982. Fish (anchovy, … Comp…     11
3 SeaSelect Foods Salt … Marebak Comp…      41902. European whole … Comp…      7
4 Samaka Chart ОАО Deli… Nalako… Comp…      16207. Live crayfish, … Comp…      6
5 suō yú Ltd. Liability… Coralm… Comp…      31567. Offers a wide r… Comp…      6
6 Arunachal Pradesh s S… Marebak Comp…      60346. Offers a wide r… Comp…      6

3.1.2 Tokenisation

The word tokenisation have different meaning in different scientific domains. In text sensing, tokenisation is the process of breaking up a given text into units called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenisation, some characters like punctuation marks may be discarded. The tokens usually become the input for the processes like parsing and text mining.

In the code chunk below, unnest_token() of tidytext is used to split text in product_services columninto words.

Show the code
token_nodes <- merged_df %>%
  mutate(product_services = ifelse(is.na(product_services), 'unknown', product_services)) %>% 
  unnest_tokens(word, 
                product_services)
Note
  • The two basic arguments to unnest_tokens() used here are column names.

  • First we have the output column name that will be created as the text is unnested into it (word, in this case), and then the input column that the text comes from (product_services, in this case).

  • Before tokenising, the NA values under product_services have been replaced by ‘unknown’ string.

  • By default, punctuation has been stripped.

  • By default, unnest_tokens() converts the tokens to lowercase, which makes them easier to compare or combine with other datasets. (Use the to_lower = FALSE argument to turn off this behavior).

3.1.3 Removing stopwords

We will use the tidytext package’s function called stop_words that will help us clean up stop words. In addition, we can add in additional stopwords in the stopwords list.

Show the code
# Create a new dataframe with customised stopwords
new_stopwords <- data.frame(word = c("unknown", "character", "0", "products","range", "offers","including"))

# Add the new stopwords to the existing stop_words dataframe
stop_words <- bind_rows(stop_words, new_stopwords)

# remove stopwords from `token_nodes`
stopwords_removed <- token_nodes %>% 
  anti_join(stop_words)
Note

There are two processes:

  • Load the stop_words data included with tidytext. This data is simply a list of words that you may want to remove in a natural language analysis.

  • Then anti_join() of dplyr package is used to remove all stop words from the analysis.

Now we can visualise the words extracted by using the code chunk below.

Show the code
stopwords_removed %>%
  count(word, sort = TRUE) %>%
  top_n(50) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Count",
      y = "Unique words",
      title = "Count of unique words found in product_services field") +
   theme(axis.text.y = element_text(size = 7.5, hjust = 0))

The diagram above shows that some words like ‘fish’, ‘seafood’, ‘salmon’ are related to the seafood industry and we could tag companies containing these words to the label ‘seafood’. The non-seafood related industry companies could be tag to the label ‘others’ while missing product services could be tag to ‘unknown’.

3.2 Preparing a master node dataframe with unique id per row

3.2.1 Adding a label column using product_services column

Creating nodes_all_notagg dataframe where this df contains a new column called label that groups all the fishing related companies together and non-seafood related companies together. There are three categories in label column namely ‘seafood’, ’ others’, ‘unknown’. nodes_all_notagg dataframe is created from merged_df.

Note that in nodes_all_notagg dataframe , there can be duplicated ids bacause an entity can be associated to more than 1 country, 1 relationship type and 1 product labels group.

(Refer to ‘Manipur Market Ltd. Liability Co’ if needed)

Show the code
#library(stringr)

# Define the seafood keywords
seafood_keywords <- c("sea food", "seafood", "fish", "prawn", "shrimp", "shell", "crab", "lobster", "mussel", "cavier", "oyster", "octopus", "squid", "aquatic", "crayfish", "tuna", "salmon", "scallop", "mackerel", "trout", "sardine", "winkle", "Barramundi", "tilapia")

# Add the 'label' column based on the 'product_services' column
nodes_all_notagg <- merged_df %>%
  mutate(product_services = ifelse(is.na(product_services), 'Unknown', product_services)) %>% 
  mutate(label = ifelse(product_services == 'Unknown' | product_services == 'character(0)', 'unknown',
                        ifelse(str_detect(product_services, regex(paste(seafood_keywords, collapse = "|"), ignore_case = TRUE)), 'seafood', 'others')))

In the code chunk above, we first check if product_services is equal to either “Unknown” or “character(0), if true, ‘label’ is set to ‘unknown’.

Next, if product_services contains seafood_keywords, then ‘label’ is set to ‘seafood’.

If none of the previous conditions are met, ‘label’ is set to ‘others’.

Note

nodes_all_notagg df can be used with edges_combined edge file

Visualising the newly created label column using bar charts. The results are as expected because most of the Beneficial Owners and Company Contacts do not have values under the product_services column.

Show the code
ggplot(data = nodes_all_notagg,
       aes(x = reorder(label,label,
                       function(x) + length(x)))) +
  geom_bar(fill='lightblue') +
  ylim(0,50000) +
  coord_flip() +
  geom_text(stat="count", 
            aes(label=paste0(..count.., ", ", 
                             round(..count../sum(..count..)*100, 1), "%")),
            hjust= -0.1,
            size = 5) +

  theme_minimal() +
  
  labs(x = 'Product services labels',
       title = 'Distribution of labels') +
  theme(plot.title = element_text(size = 22,
                                  face='bold', 
                                  hjust = 0.5),
        axis.title.x=element_text(size= 20,
                                  hjust = 0.5),
        axis.title.y=element_text(size= 20),
        axis.text.x = element_text(size = 15, 
                                  color = "black", 
                                  angle = 0, 
                                  hjust = 0.5, 
                                  vjust = 0.5),
        axis.text.y = element_text(size = 15, 
                                  color = "black", 
                                  angle = 0, 
                                  hjust = 0.5, 
                                  vjust = 0.5),
        panel.grid.major.y = element_blank() )

3.2.2 Aggregation to ensure no duplicates of id in master node df

As explained earlier , the entity names in the id column of the nodes_all_notagg dataframe is not unique.

Thus from nodes_all_notagg dataframe, we will create an aggregated version of it called nodes_all_agg dataframe (our master node df) , where there is only 1 instance of each id. It would mean aggregation is needed in order not to lose any information.

First, group by id and create new columns :

country_qty : number of countries that an entity is associated with

type_qty : number of business relationship types an entity has

revenue_sum: total revenue of entity

label_qty: number of categories of products associated with entity

country_concat : list of all countries associated with entity

type_concat : list of all business relationship types associated with entity

label_concat : list of all product labels associated with entity

Show the code
 nodes_all_agg <- nodes_all_notagg %>%
  group_by(id) %>%
  summarise(country_qty = ifelse(all(is.na(country)), 0, n_distinct(country)),
            type_qty = ifelse(all(is.na(type)), 0, n_distinct(type)),
            label_qty = n_distinct(label),
            revenue_sum = sum(revenue_omu, na.rm=TRUE),
            country_concat = ifelse(country_qty > 1, paste(unique(country), collapse = ', '), country[1]),
            type_concat = ifelse(type_qty > 1, paste(unique(type), collapse = ', '), type[1]),
            label_concat = ifelse(label_qty > 1, paste(unique(label), collapse = ', '), label[1]),
            product_services_concat = paste(product_services, collapse = "| ")
            )
Note

The codes above checks for each entity

  • country values contains all null. If so, country_qyt column is 0. Otherwise, the number of distinct countries is computed using n_distinct().

  • the number of distinct countries (country_qty) is greater than 1. If it is, it uses paste() with unique(country) to concatenate only the unique country values, separated by a comma and a space. Otherwise, it simply uses the first country value in the group (country[1]). This ensures that the country values are concatenated only if they are different within each group.

Take a peek at a few rows in `nodes_all_agg’. It contains information of seafood and non-seafood related entities and has 34,442 rows in total.

Show the code
nodes_all_agg %>% arrange(desc(country_qty)) %>% slice(100:105) %>%  kable()
id country_qty type_qty label_qty revenue_sum country_concat type_concat label_concat product_services_concat
Oceanfront Oasis GmbH & Co. KG Carriers 2 1 2 138105.75 Brindivaria, Oceanus Company others, unknown Lotions and other skin care products, cosmetics, perfumes and toilet preparations, and other personal hygiene products| Unknown
Océano del Este ОАО 2 1 2 136636.13 Solovarossa, Oceanus Company seafood, unknown Offers a wide range of products such as salmon and salmon eggs, parr and smolt, and various other species, including cobia and cod juveniles| Unknown
Odisha Ltd. Liability Co 2 1 2 93427.09 Oceanus, Marebak Company unknown, seafood Unknown| A range of fish and other related seafood products| Unknown
Oka Ltd. Corporation Transport 2 1 1 40152.33 Korvelonia, Oceanus Company seafood Fish and fish products| Pink and chum salmon
Ola Azul Ges.m.b.H. Services 2 1 2 59994.96 Nalakond, Zawalinda Company seafood, unknown Seafood product preparation and packaging| Unknown
Ola del Mar SRL 2 1 2 33526.70 Zawalinda, Puerto del Mar Company others, unknown Primarily involved in providing air, surface, or combined courier delivery services of parcels generally between metropolitan areas or urban centers| Unknown

3.2.3 seafood_entities df

In this section, we will subset the nodes_all_agg dataframe above by filtering label == ‘seafood’ to extract only seafood-related entities information.

Filter for ids with label_concat column containing seafood

Take a peek at a few rows in seafood_entities. There are 651 seafood related entities.

id country_qty type_qty label_qty revenue_sum country_concat type_concat label_concat product_services_concat
Bahía de Plata Trout 1 1 1 16349.68 Rio Isla Company seafood Grocery products (Canned and frozen foods, milk, fresh fruits and vegetables, fresh and prepared meats, fish, poultry and poultry products); Other household products (alcohol, household cleaning products, medicine, and clothes)
Bahía del Sol Deep-sea 1 1 1 36095.44 Nalakond Company seafood Fish and seafood products (tuna, salmon, herring, shellfish, and groundfish products; and flounder fillets, cornmeal pollock strips, burger, tuna steak, frozen halibut steaks, as well as canned sockeye salmon and frozen sockeye, and crabs)
Bahía del Sol Kga Consulting 1 1 1 14016.58 Vesperanda Company seafood Seafood products
Baker LLC 1 3 2 11016.82 ZH Company, Beneficial Owner, Company Contacts seafood, unknown Canned fish and seafood products| character(0)| character(0)
Baker and Sons 1 2 2 104095830.23 ZH Company, Beneficial Owner seafood, unknown Fish; fresh or chilled, mackerel (Scomber scombrus, Scomber australasicus, Scomber japonicus), excluding fillets, fish meat of 0304, and edible fish offal of subheadings 0302.91 to 0302.99| character(0)
Balkan GmbH & Co. KG Cargo 1 1 1 29457.13 Nalakond Company seafood Seafood and related products

3.2.4 seafood_edges df

Next, we need to filter only the relevant relationship from the edges_combined df according to seafood_entities df .

Filter the relevant rows from edges_combined df where either of its target or source values are found in id column of seafood_entities dataframe. This would filter out all relationship related to seafood related entities. There are no duplicates in seafood_edges df.

Show the code
# Filter rows in edges_combined df based on matching values of id column in the'seafood_entities' dataframe
seafood_edges <- edges_combined[edges_combined$source %in% seafood_entities$id | edges_combined$target %in% seafood_entities$id, ]

3.2.5 seafood_nodes df

Using seafood_edges df, create a node file called seafood_nodes for network visualisation later.

A left join with nodes_all_agg is required to ingest all the nodes attributes.

Show the code
# Extract unique ids from 'source' and 'target' columns of 'seafood_edges' dataframe
unique_entities <- unique(c(seafood_edges$source, seafood_edges$target))

# Create a new 'seafood_nodes' dataframe as a tibble
seafood_nodes <- as_tibble(data.frame(id = unique_entities))

# Perform left outer join with my nodes master table 'nodes_all_agg' 
seafood_nodes <- left_join(seafood_nodes, nodes_all_agg, by = "id")

4 The analysis (Seafood)

Seafood_graph tbl object

The seafood_nodes and seafood_edges dfs will be the main files used to create a tbl graph object. The tbl_graph function from tidygraph library will be used.

The seafood_graph object has 3114 nodes and 2505 edges. There are 609 subgraphs inside.

In addition, we will calculate various centrality scores and also add a column called membership for us to recognise which subgraph each id belongs to.

Show the code
seafood_graph<- tbl_graph(nodes=seafood_nodes,
                          edges = seafood_edges,
                          directed = FALSE)

seafood_graph <- seafood_graph %>% 
  activate(nodes) %>% 
  mutate(betweenness_centrality = centrality_betweenness(),
         closeness_centrality = centrality_closeness(),
         eigenvector_centrality = centrality_eigen(),
         membership = components(seafood_graph)$membership)

seafood_graph nodes interactive datatable below to show all the additional centrality and membership attributes calculated using wrapper functions of the tidygraph package. You could sort the various columns inside.

We will build a network graph that contains only seafood_related business relationships. The graph will show only entities with high betweenness scores.

Prepare edge file for visNetwork library. Rename source to from and target to to.

Show the code
seafood_edges_vis <- seafood_edges %>% 
  rename(from = source) %>% 
  rename(to = target) %>% 
  mutate(title = type)

Prepare node file for visNetwork library. By renaming type_concat to group, visnetwork could help us to colour the nodes by business relationship types.

Show the code
seafood_nodes_vis <- seafood_graph %>% 
  activate(nodes) %>% 
  as_tibble() %>% 
  mutate(group = type_concat) %>% 
  mutate(title = paste('id = ',id, "\n ,Betweenness =", betweenness_centrality,"\n ,Revenue =", revenue_sum, "\n ,Country =", country_concat))

Let us visualise the nodes with betweenness score 15 and above.

Show the code
set.seed(1234)
visNetwork(seafood_nodes_vis %>% filter(betweenness_centrality >= 15),
           seafood_edges_vis,
           main = "Interactive Network graph of top betweenness entities",
           height = "500px", width = "100%") %>%
  visIgraphLayout(layout = "layout_nicely") %>%
  visEdges(smooth = list(enables = TRUE,
                         type= 'straightCross'),
           shadow = FALSE,
           dash = FALSE) %>% 
  visOptions(highlightNearest = list(enabled = T, degree = 1, hover = T),
             nodesIdSelection = TRUE,
             selectedBy = "group") %>%
  visInteraction(hideEdgesOnDrag = TRUE) %>% 
  visLegend() %>%
  visLayout(randomSeed = 123)

All the entities shown above have relatively higher betweenness scores than other entities in the entire network. Hovering the mouse over each node and edge will reveal attributes like betweenness scores, country associated each node with and its revenue.

At a glance, there are two subgraphs (bottom) with 5 high betweenness nodes connected to one another, forming a larger than normal network. We are interested in subgraphs with longer network diameter like these because they represent more complex business relationship. For instance, in such complex subgraphs, entities involved are ‘company’, ‘business owners’ and ‘company contacts’ and there are also more than 1 company in the network. From literature review, having entities associated to two or more companies (conflict of interests could occur) could suggests transshipment activities. This could allows us to investigate the entities in the complex subgraphs to check for any IUU crime.

Instead of having to ‘eye-ball’ each of 609 the seafood network subgraphs to identify those with longer network diameter, we can extract edge data of each subgraph (using membership column) and compile the network diameter of each subgraph into a dataframe.

Calculate network diameter

Show the code
# Get unique membership values
unique_memberships <- unique(seafood_nodes_vis$membership)

# Initialize empty list to store results
results <- list()

# Iterate over each membership value
for (x in unique_memberships) {
  # Filter nodes based on membership
  nodes <- seafood_nodes_vis %>%
    filter(membership == x)
  
  # Filter edges based on nodes
  edges <- seafood_edges_vis %>%
    filter(from %in% nodes$id | to %in% nodes$id)
  
  # Create subgraph
  subgraph <- as_tbl_graph(edges, directed = FALSE)
  
  # Calculate network diameter
  diameter <- with_graph(subgraph, graph_diameter())
  
  # Store results in list
  results[[as.character(x)]] <- diameter
}
# Create DataFrame with membership and network diameter columns
diameter_df <- tibble(
  membership = unique_memberships,
  network_diameter = unlist(results)
)

Let us visualise the distribution of network_diameter of all the seafood subgraphs.

First, let us fix the order of the network diameter field by descending counts of network diameter.

Show the code
diameter_counts <- count(diameter_df, network_diameter)

# Reorder the factor levels of network_diameter based on count
diameter_df$network_diameter1 <- factor(
  diameter_df$network_diameter,
  levels = diameter_counts$network_diameter[order(diameter_counts$n, decreasing = FALSE)]
)
Show the code
d <- highlight_key(diameter_df %>% arrange(desc(network_diameter1))) 


p<-ggplot(data=diameter_df, 
       aes(x=as.factor(network_diameter1))) +
  geom_bar(fill='lightblue') +
  coord_flip() +
  theme_minimal() + 
    labs(title = 'Distribution of Network Diameter',
         x= 'Network diameter') +
  theme(plot.title = element_text(face='bold'))

gg <- highlight(ggplotly(p),        
                "plotly_selected")  

crosstalk::bscols(gg,
                  DT::datatable(d,options = list(iDisplayLength = 5)),
                  widths=5)

There are about 32 subgraphs with network diameter of 4 and above. We can investigate these subgraphs closer.

4.1 Subgraphs with high NETWORK RISK

First, let us ingest the network diameter of each subgraph into the main seafood_nodes_vis dataframe by performing a left join with diameter_df using membership columns in both df as join key.

Show the code
seafood_nodes_vis<- seafood_nodes_vis %>% 
  left_join(diameter_df, by='membership',
            unmatched = "drop") %>% 
  arrange(desc(network_diameter), desc(betweenness_centrality)) %>% 
  select(id,membership,network_diameter,betweenness_centrality,closeness_centrality,eigenvector_centrality,revenue_sum,country_qty,type_qty,label_qty,country_concat,type_concat,label_concat,product_services_concat,group,title)

Is there a difference in the betweenness scores across network diameter?

Show the code
ggbetweenstats(data = seafood_nodes_vis, x = network_diameter, y = betweenness_centrality,
               xlab = "network_diameter", ylab = "betweenness_centrality",
               type = "np", pairwise.comparisons = TRUE, pairwise.display = "s",
               sort = "descending",
               sort.fun = median,
               mean.ci = T, p.adjust.method = "fdr",  conf.level = 0.95,
               title = "Comparison of Betweenness centrality across different network diameters") +
  scale_y_continuous(limits = c(0, 2500)) +
   theme(axis.title.y=element_text(angle = 0,
                                  vjust=0.9))

From the plot above, p value< 0.05 and we have evidence to conclude that the betweenness scores across network of different diameter are different. There are many entities with very high betweenness scores in subgraphs with network_diameter of 2, 4 and 6. However, in this section we will only focus on subgraphs with network_diameter of 4 and above because of more complex business relationships within them.

Next filter the subgraphs where network diameter is 4 and above.

Show the code
network_risk_nodes_vis<- seafood_nodes_vis %>% 
  filter(network_diameter >= 4) %>% 
  arrange(desc(network_diameter), desc(betweenness_centrality))

There are only 2 subgraphs with network diameter of 6. They are subgraphs 112 and 227.

Show the code
#network_risk_nodes_vis %>% select(id,membership, network_diameter) %>%  head(10) %>% kable()

DT::datatable(network_risk_nodes_vis ,options = list(iDisplayLength = 5))

Visualisation of the two subgraphs with highest network diameter.

Filter for all the nodes and edges in subgraph 112 from seafood_nodes_vis dataframe.

Show the code
m<- 112
sub112_nodes_vis <- seafood_nodes_vis %>%  filter(membership==m)
sub112_edges_vis <- seafood_edges_vis[seafood_edges_vis$from %in% sub112_nodes_vis$id | seafood_edges_vis$to %in% sub112_nodes_vis$id, ]

Visualise the network graph 112.

Show the code
set.seed(1234)
visNetwork(sub112_nodes_vis, #%>% filter(betweenness_centrality >= 20),
           sub112_edges_vis,
           main = "Network graph of subgraph 112 with diameter =6",
           height = "500px", width = "100%") %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visEdges(smooth = list(enables = TRUE,
                         type= 'straightCross'),
           shadow = FALSE,
           dash = FALSE) %>% 
  visOptions(highlightNearest = list(enabled = T, degree = 1, hover = T),
             nodesIdSelection = TRUE,
             selectedBy = "group") %>%
  visInteraction(hideEdgesOnDrag = TRUE) %>% 
  visLegend() %>%
  visLayout(randomSeed = 123)

In this subgraph, we can see three companies (blue) linked together by two individuals, namely Andrew Reed and John Hernandez.

Andrew Reed is a beneficial owner to both ‘Adair S.A. de C.V.’ and ‘Oka  Ltd. Corporation Transport’.

John Hernandez is a Company contacts of ‘Danish Plaice Swordfish AB Shipping’ and beneficial owner of ‘Adair S.A. de C.V.’

The datatable containing details of the members in subgraph 112 allows us to filter them by betweenness scores and other attributes.

Show the code
DT::datatable(seafood_nodes_vis %>%  filter(membership==m), options = list(iDisplayLength = 5))

Filter for all the nodes and edges in subgraph 227 from seafood_nodes_vis dataframe

Show the code
m<- 227
sub227_nodes_vis <- seafood_nodes_vis %>%  filter(membership==m)
sub227_edges_vis <- seafood_edges_vis[seafood_edges_vis$from %in% sub227_nodes_vis$id | seafood_edges_vis$to %in% sub227_nodes_vis$id, ]

Visualising the subgraph 227

Show the code
set.seed(1234)
visNetwork(sub227_nodes_vis, #%>% filter(betweenness_centrality >= 20),
           sub227_edges_vis,
           main = "Network graph of subgraph 227 with diameter =6",
           height = "500px", width = "100%") %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visEdges(smooth = list(enables = TRUE,
                         type= 'straightCross'),
           shadow = FALSE,
           dash = FALSE) %>% 
  visOptions(highlightNearest = list(enabled = T, degree = 1, hover = T),
             nodesIdSelection = TRUE,
             selectedBy = "group") %>%
  visInteraction(hideEdgesOnDrag = TRUE) %>% 
  visLegend() %>%
  visLayout(randomSeed = 123)

In this subgraph, we can see three companies linked by two individuals, namely Christopher Rodrigues and Lisa Brown.

Lisa Brown is beneficial owner to both ‘Tide NV solutions’ and ‘Deep Blue Cargo ship’.

Christopher Rodriguez is a beneficial owner of ‘Deep Blue Cargo ship’ and a company contacts of ‘Lewis PLC’. It seems to suggest that Deep Blue Cargo ship’ could have business dealings with ‘Lewis PLC’ via Christopher.

The datatable containing details of the members in subgraph 227 allows us to filter them by betweenness scores and other attributes.

Show the code
DT::datatable(seafood_nodes_vis %>%  filter(membership==m) , options = list(iDisplayLength = 5))

4.2 High FINANCIAL RISK

In this section, we will look out for subgraphs with exceptionally high total revenue. As we expect bigger subgraphs to have higher revenue, thus the total revenue will be divided by the number of companies in the subgraph for fair comparison.

The code chunk below first group by membership and network_diameter columns to calculate sum of revenue called total_revenue_subgraph. It then sum up the number of times ‘Company’ appeared under the type_concat column to give us the number of distinct companies inside the subgraph. Finally, the total revenue for each subgraph is divided by the number of companies to give revenue_per_company field.

Show the code
financial_risk_nodes <- seafood_nodes_vis %>%
  group_by(membership, network_diameter) %>%
  summarize(
    total_revenue_subgraph = as.numeric(sum(revenue_sum, na.rm=TRUE)),
    no_of_companies_subgraph = sum(str_count(type_concat, "\\bCompany\\b")) - sum(str_count(type_concat, "\\bCompany Contacts\\b"))) %>% 
  ungroup() %>%
mutate(
    revenue_per_company = ifelse(no_of_companies_subgraph == 0, total_revenue_subgraph, round(total_revenue_subgraph / no_of_companies_subgraph))) %>% 
  arrange(desc(revenue_per_company))

The summary function shows signs of extreme outliers revenue_per_company value of $291,436,839. We might have to consider using log scale for box plots.

Show the code
summary(financial_risk_nodes$revenue_per_company)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
        0     12822     32167   1265696     72367 291436839 

Visualising the distribution of revenue per company across different network diameters.

Show the code
library(scales)

ggplotly(ggplot(data = financial_risk_nodes %>% 
                  mutate(network_diameter=as.factor(network_diameter)), 
                aes(x = reorder(network_diameter, 
                                -revenue_per_company, 
                                median), 
                    y = revenue_per_company, 
                    fill=network_diameter)) +
           
   geom_boxplot(outlier.colour="blue", 
                outlier.size=1) +
     
   geom_point(aes(label=membership),
              position = 'jitter',
              size=0.5) +
     
   stat_summary(fun.y=mean, 
                geom="point", 
                shape=20, 
                size=2.5, 
                color="pink", 
                fill="red") +
   xlab("Network diameter size") +
   ylab("Revenue per company") +
   ggtitle("Revenue per company by network diameter size") +
   #scale_fill_brewer(palette='Set2') +
   theme(plot.title = element_text(face= 'bold',
                                   hjust = 0.5),
         legend.position = 'none') +
     
   scale_y_continuous(trans = log10_trans()))
Show the code
    #breaks = trans_breaks("log10", function(x) 10^x),
    #labels = trans_format("log10", math_format(10^.x))))

Hovering the mouse above the points will reveal their subgraph membership numbers. Surprisely, there are more smaller network diameter groups with higher revenue per company.

The top 10 subgraphs with the highest total revenue per company are listed below. The network diameters are mainly 1 or 2.

revenue_per_company membership network_diameter no_of_companies_subgraph
291436839 396 1 1
131450837 577 2 0
95809780 316 2 1
63153107 173 2 1
55376193 193 1 0
52053645 162 4 2
32183079 581 2 1
1507514 498 2 1
1216029 553 2 1
1205868 48 1 1

Visualising the top 2 subgraphs in terms of revenue per company.

Subgraph 193

Show the code
m<- 396
sub227_nodes_vis <- seafood_nodes_vis %>%  filter(membership==m)
sub227_edges_vis <- seafood_edges_vis[seafood_edges_vis$from %in% sub227_nodes_vis$id | seafood_edges_vis$to %in% sub227_nodes_vis$id, ]

set.seed(1234)
visNetwork(sub227_nodes_vis, #%>% filter(betweenness_centrality >= 20),
           sub227_edges_vis,
           main = "Network graph of subgraph 396 with diameter = 1",
           height = "500px", width = "100%") %>%
  visIgraphLayout(layout = "layout_nicely") %>%
  visEdges(smooth = list(enables = TRUE,
                         type= 'straightCross'),
           shadow = FALSE,
           dash = FALSE) %>% 
  visOptions(highlightNearest = list(enabled = T, degree = 1, hover = T),
             nodesIdSelection = TRUE,
             selectedBy = "group") %>%
  visInteraction(hideEdgesOnDrag = TRUE) %>% 
  visLegend() %>%
  visLayout(randomSeed = 123)

In this subgraph, Morgan Group commands a very high revenue of 291 million with only one business relationship with Jason Cole who is its Beneficial owner. Morgan Group sells fish (smoked or not smoked) products.

Information about the two members:

Show the code
DT::datatable(seafood_nodes_vis %>%  filter(membership==m))

Subgraph 498

Show the code
m<- 577
sub227_nodes_vis <- seafood_nodes_vis %>%  filter(membership==m)
sub227_edges_vis <- seafood_edges_vis[seafood_edges_vis$from %in% sub227_nodes_vis$id | seafood_edges_vis$to %in% sub227_nodes_vis$id, ]

set.seed(1234)
visNetwork(sub227_nodes_vis, #%>% filter(betweenness_centrality >= 20),
           sub227_edges_vis,
           main = "Network graph of subgraph 577 with diameter = 2",
           height = "500px", width = "100%") %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visEdges(smooth = list(enables = TRUE,
                         type= 'straightCross'),
           shadow = FALSE,
           dash = FALSE) %>% 
  visOptions(highlightNearest = list(enabled = T, degree = 1, hover = T),
             nodesIdSelection = TRUE,
             selectedBy = "group") %>%
  visInteraction(hideEdgesOnDrag = TRUE) %>% 
  visLegend() %>%
  visLayout(randomSeed = 123)

In this subgraph, the total revenue of the three individuals is 131 million recorded under the name of ‘WIlson LLC’. The product service offered by this company is ‘Fish, dried but not smoked’. ‘Wilson LLC’ seems like a company than company contacts or beneficial owner.

Information about the members:

Show the code
DT::datatable(seafood_nodes_vis %>%  filter(membership==m),options = list(iDisplayLength = 5))

4.3 High COUNTRY RISK

In this section, we will focus on subgraphs with business relationship involving many countries. This could usually indicate transshipment across carriers from different countries.

According to literature reviews, a method of IUU is via “Flags of convenience” where vessels fly different country flags at different location to avoid inspections and intersections by local governing bodies.

The code chunk below is a continuation from section 4.3. It create a new column no_of_countries_subgraph where it computes the number of distinct values of countries present in each subgraph.

Show the code
country_risk_nodes <- seafood_nodes_vis %>%
  group_by(membership, network_diameter) %>%
  summarize(
    total_revenue_subgraph = as.numeric(sum(revenue_sum, na.rm=TRUE)),
    no_of_companies_subgraph = sum(str_count(type_concat, "\\bCompany\\b")) - sum(str_count(type_concat, "\\bCompany Contacts\\b")),
  no_of_countries_subgraph = n_distinct(na.omit(unlist(strsplit(country_concat, ", "))))) %>%
  ungroup() %>%
mutate(
    revenue_per_company = ifelse(no_of_companies_subgraph == 0, total_revenue_subgraph, round(total_revenue_subgraph / no_of_companies_subgraph))) %>% 
  arrange(desc(revenue_per_company))

We will use a bar chart to visualise the count of countries in the business relationship.

Show the code
d <- highlight_key(country_risk_nodes %>% arrange(desc(no_of_countries_subgraph))) 


p<-ggplot(data=country_risk_nodes, 
       aes(x=as.factor(no_of_countries_subgraph))) +
  geom_bar(fill = 'lightblue') +
  coord_flip() +
  theme_minimal() + 
    labs(title= 'Distribution of \nnumber of countries \nin subgraphs',
         x= 'Number of countries involved in a subgraph')

gg <- highlight(ggplotly(p),        
                "plotly_selected")  

crosstalk::bscols(gg,
                  DT::datatable(d,options = list(iDisplayLength = 5)),
                  widths=5)

Subgraph 6 is associated with the highest number of countries (10).

Subgraph 85 is associated with 4 different countries.

There are 15 other subgraphs that are associated with 3 different countries.

Subgraph 6 is very special because of the relationship between more than 10 countries so let us zoom in on its network graph.

Show the code
m<- 6
sub227_nodes_vis <- seafood_nodes_vis %>%  filter(membership==m)
sub227_edges_vis <- seafood_edges_vis[seafood_edges_vis$from %in% sub227_nodes_vis$id | seafood_edges_vis$to %in% sub227_nodes_vis$id, ]

set.seed(1234)
visNetwork(sub227_nodes_vis, #%>% filter(betweenness_centrality >= 20),
           sub227_edges_vis,
           main = "Network graph of subgraph 6 with diameter = 4",
           height = "500px", width = "100%") %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visEdges(smooth = list(enables = TRUE,
                         type= 'straightCross'),
           shadow = FALSE,
           dash = FALSE) %>% 
  visOptions(highlightNearest = list(enabled = T, degree = 1, hover = T),
             nodesIdSelection = TRUE,
             selectedBy = "group") %>%
  visInteraction(hideEdgesOnDrag = TRUE) %>% 
  visLegend() %>%
  visLayout(randomSeed = 123)

In the above subgraph, there are two companies ‘Aqua Aura SE Marine life’ registered in countries (Mawazam, Rio Isla, Icarnia, Oceanus, Nalakond, Coralmarica, Alverossia, Isliandor, Talandria) and ’ BlueWater Bistro GmbH Industrial’ company registered in (Marebak).

Information about its members:

Show the code
DT::datatable(seafood_nodes_vis %>%  filter(membership==m) , options = list(iDisplayLength = 5))

4.4 Finding similar subgraphs (groups)

In this section, we will use parallel coordinate plot to visualise and analyse multivariate, numerical data we have of each subgraphs. We will be comparing multiple variables

  • no_of_companies_subgraph : the number of distinct companies in the subgraph

  • no_of_beneficial_owners_subgraph: the number of beneficial owners in the subgraph

  • top_betweenness : the highest betweenness score of an entity in the subgraph

  • revenue_per_company: the total revenue of the subgraph divided by the total number of companies in subgraph

Preparing the dataframe for the plot by creating two more new columns to calculate the number of beneficial owners and top betweenness score in each subgraph.

Show the code
prll <- seafood_nodes_vis %>%
  group_by(membership, network_diameter) %>%
  summarize(
    total_revenue_subgraph = as.numeric(sum(revenue_sum, na.rm=TRUE)),
    no_of_companies_subgraph = sum(str_count(type_concat, "\\bCompany\\b")) - sum(str_count(type_concat, "\\bCompany Contacts\\b")),
  no_of_countries_subgraph = n_distinct(na.omit(unlist(strsplit(country_concat, ", ")))),
  no_of_beneficial_owners_subgraph = sum(str_count(type_concat, "Beneficial Owner|Beneficial Owner, Company Contacts")),
  top_betweenness = max(betweenness_centrality)) %>%
  ungroup() %>%
mutate(
    revenue_per_company = ifelse(no_of_companies_subgraph == 0, total_revenue_subgraph, round(total_revenue_subgraph / no_of_companies_subgraph)),
    membership = as.factor(membership)) %>% 
  arrange(desc(revenue_per_company))

We will filter subgraphs with network diameter 3 and above. The graph will also be faceted by the number of countries involvment in each subgraph. (Refer to the subtitle for the meaning of each box)

Show the code
set.seed(1234)
ggparcoord(prll %>% 
             filter(as.numeric(network_diameter) >=3) %>%
             mutate(no_of_countries_subgraph=as.factor(no_of_countries_subgraph),
                    network_diameter=as.factor(network_diameter)),
           columns = c(4,6:8),
           groupColumn = 2,
           scale = "uniminmax",
           alphaLines = 0.3,
           boxplot = TRUE, 
           title = "Parallel Coordinates Plot of subgraph variables") +
  theme(plot.title = element_text(face='bold'),
        legend.position = "right",
        legend.text = element_text(size = 8),
        legend.title = element_text(size = 10),
        axis.text.x = element_text(angle = 35, hjust = 1),
        axis.title.y = element_blank()) +
  labs(subtitle='Box 1: 1 country, Box 2: 2 countries, Box 3: 3 countries, Box 4: 10 countries') +
  
  facet_wrap(~ `no_of_countries_subgraph`)

Box 1 shows that there is one subgraph with network diameter =3 that has only 1 country in the relationship.

In box 2, the subgraphs with higher network diameter tend to have more beneficial owners in the relationship.

In box 3, there are subgraphs with more number of companies in the relationship. The number of beneficial owner, top betweenness score and revenue per company is also slightly higher than in box 2.

Box 4 shows that there is one subgraph with network diameter =4 that has 10 countries in the relationship. The number of beneficial owners and top betweenness scores in te subgraph are also observed to be the highest amongst all the boxes.

5 Conclusion

From the anomalies raised in the earlier sections, it would be worth investigating subgraphs with bigger network diamaters , higher revenue per company and subgraphs with multiple countries presence.

6 REFERENCES

Kam, T. S. (2022, December 4). R for Data Science. r4va. Retrieved from https://r4va.netlify.app/

Truelove, J. (2021, September 19). Hive plots with the ggraph and hiver packages. GitHub Pages. https://jtr13.github.io/cc21fall2/hive-plots-with-the-ggraph-and-hiver-packages.html

Nowak, B. J. (2021, September 30). R Network Analysis with Tidygraph. Netlify. https://bjnnowak.netlify.app/2021/09/30/r-network-analysis-with-tidygraph/

Ross, Z. (2019, April 2). Easy multi-panel plots in R using facet_wrap and facet_grid from ggplot2. ZevRoss Spatial Analysis Blog. http://zevross.com/blog/2019/04/02/easy-multi-panel-plots-in-r-using-facet_wrap-and-facet_grid-from-ggplot2/